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Abstract 


Computing  marginal  probabilities  (whether  prior  or  posterior)  in  Bayes¬ 
ian  belief  networks  is  a  hard  problem.  This  paper  discusses  deterministic 
approximation  schemes  that  work  by  adding  up  the  probability  mass  in  a 
small  number  of  value  assignments  to  the  network  variables.  Under  cer¬ 
tain  assumptions,  the  probability  mass  in  the  union  of  these  assignments 
is  sufRdent  to  obtain  a  good  approximation.  Such  methods  are  especially 
useful  for  highly-connected  networks,  where  the  maximum  clique  size  or 
the  cutset  size  make  the  standard  algorithms  intractable. 

In  considering  assignments,  it  is  not  necessary  to  assign  values  to  vari¬ 
ables  that  are  independent  of  (d-separated  from)  the  evidence  and  query 
nodes.  In  many  cases,  however,  there  is  a  finer  independence  structure  not 
evident  from  the  topology,  but  dependent  on  the  conditional  distributions 
of  the  nodes.  We  note  that  independence-based  (IB)  assignments,  which 
were  originally  proposed  as  theory  of  abductive  explanations,  take  advan¬ 
tage  of  such  independence,  and  thus  contain  fewer  assigned  variables. 

As  a  result,  the  probability  mass  in  each  assignment  is  greater  than 
in  the  respective  complete  assignment.  Thus,  fewer  IB  assignments  are 
sufficient,  and  a  good  approximation  can  be  obtained  more  efficiently. 
IB  assignments  can  be  used  for  efficiently  approximating  posterior  node 
probabilities  even  in  cases  which  do  not  obey  the  rather  strict  skewness 
assumptions  used  in  previous  research.  Two  algorithms  for  finding  the 
high  probability  IB  assignments  are  suggested:  one  by  doing  a  best-first 
heuristic  search,  and  another  by  special-purpose  integer  linear  program¬ 
ming.  Experimental  results  show  that  this  approach  is  feasible  for  highly 
connected  belief  networks. 
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INTRODUCTION 


Computation  of  marginal  probabilities  of  variables  in  a  Bayesian  belief  network 
is  a  problem  of  pairticular  research  interest  for  the  probabilistic  reasoning  com¬ 
munity.  Although  a  polynomial-time  algorithm  for  computing  the  probabilities 
exists  for  polytrees  [21],  the  problem  was  proved  to  be  NP-hard  in  the  general 
case  in  [9].  Several  categories  of  exact  algorithms  exist  for  computing  posterior 
probabilities:  clustering  and  junction-trees  [23,  20],  conditioning  [10],  arc  rever¬ 
sal  [36],  and  term  evaluation  [24].  Many  variants  of  these  2dgorithms  attempt 
various  refinements  of  these  schemes,  e.g.  [13].  All  of  these  algorithms  have 
exponential-time  algorithms  in  the  worst  case,  where  the  runtime  is  a  function 
of  the  topology  and  the  number  of  states  each  vsiriable  can  assume  (in  this  paper 
we  only  refer  to  networks  where  each  variable  has  a  finite  number  of  states). 

In  the  hope  of  avoiding  an  exponential  runtime,  a  host  of  approximation 
algorithms  have  emerged.  As  it  turns  out,  theoretically,  even  approximat¬ 
ing  marginal  probabilities  in  belief  networks  is  NP-hard,  2m  thus  there  ts  no 
polynomial-time  (deterministic)  approximation  algorithm  unless  P=NP  [11]. 
Most  approximation  algorithms  are  less  affected  by  network  topology,  and  2u^e 
dependent  on  the  actual  probabilities  as  to  their  runtimes  and  quality  of  ap¬ 
proximation.  If  the  topology  of  a  given  network  is  such  that  exact  algorithms  is 
expected  to  take  a  long  runtime,  it  may  be  advisable  to  run  an  approximation 
algorithm  and  hope  that  the  probabilities  are  such  that  we  can  get  a  good  ap¬ 
proximation  in  reasonable  time  for  the  problem  instance  at  hand.  In  addition, 
most  approximation  algorithms  have  an  anytime  behaviour,  which  facilitates 
trading  off  time  for  precision  in  a  graded  manner. 

Two  major  categories  of  marginal  probability  approximation  algorithms  ex¬ 
ist:  randomized  approximation  algorithms,  and  deterministic  approximation 
algorithms.  In  [19],  approximation  is  achieved  by  stochastically  sampling  in¬ 
stantiations  of  the  network  variables.  Later  work  in  randomized  approximation 
algorithms  attempts  to  increase  sampling  efficiency  [4],  and  to  handle  the  case 
where  the  probability  of  the  evidence  is  very  low  [15],  which  is  a  serious  problem 
for  most  sampling  algorithms.  In  what  follows,  we  focus  on  the  second  category, 
deterministic  approximation  algorithms.  In  bounded  conditioning  [14],  one  uses 
the  conditioning  method,  but  conditions  only  on  a  small,  high  probability,  sub¬ 
set  of  the  (exponential  size)  set  of  possible  assignments  to  the  cutset  variable. 
Other  approximation  algorithms  attempt  to  simplify  the  network  by  removing 
arcs  between  nodes  that  are  almost  independent,  to  produce  a  network  that  is 
hopefully  tractable  topologically.  An  exact  algorithm  is  then  run  on  the  “ap¬ 
proximate”  network,  to  produce  2ui  approximate  answer  [22].  Another  source 
of  complexity  is  the  large  number  of  states  per  node  in  various  applications.  To 
alleviate  that  problem,  an  approximation  based  on  merging  states  was  suggested 
[41].  The  scheme  begins  by  making  all  variables  unary  valued,  and  successively 
refining  the  states  of  variables,  while  performing  probability  updating  on  the 
approximate  network  and  thus  getting  a  successively  better  approximation  in 
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each  step. 

Another  category  of  deterministic  approximation  algorithms  is  based  on  de¬ 
terministic  enumeration  of  terms  or  assignments  to  variables  in  the  network. 
The  idea  is  to  enumerate  a  set  of  high-probability  complete  assignments  to  all 
the  vuiables  in  the  network  (but  frequently  partial  assignments  suffice,  see  be¬ 
low).  The  probability  of  each  such  assignment  can  be  computed  quickly:  in 
0(n),  or  sometimes  even  (incrementally)  in  0(1).  The  probability  of  a  particu¬ 
lar  instantiation  to  a  variable  v  (say  v  =  Vi)  is  approximated  by  simply  dividing 
the  probability  mass  of  all  assignments  which  contain  v  =  Vi  by  the  total  mass 
of  enumerated  assignments.  If  only  assignments  compatible  with  the  evidence 
are  enumerated,  this  gives  the  posterior  probability  of  v  =  vj .  If  the  enumer¬ 
ated  assignments  have  a  sufficiently  large  probability  mass,  then  we  get  a  good 
approximation. 

In  [12]  the  ideas  of  incremental  operations  for  probabilistic  reasoning  were 
investigated.  Among  them  was  a  suggestion  for  approximating  marginal  prob¬ 
abilities  by  enumerating  high-probability  terms.  One  interesting  point  is  the 
skewness  result:  if  a  network  has  a  distribution  such  that  every  row  in  the 
distribution  arrays  has  one  entry  greater  than  2^,  then  collecting  only  n  -I- 1 
assignments,  we  also  have  at  least  |  of  the  probability  mass.  Taking  the  topology 
of  the  network  into  account,  and  using  term  computations,  this  cam  presumably 
be  achieved  efficiently.  However,  the  skewness  assumption  as  is  seems  some¬ 
what  restrictive.  The  assumption  may  hold  in  some  domains,  such  as  circuit 
fault  diagnosis,  but  not  in  medical  diagnosis,  or  in  the  randomly  generated  net¬ 
works  on  which  we  tested  our  algorithms.  Trying  to  relax  the  constraint,  say 
to  probability  entries  greater  than  (^^)^,  already  forces  us  to  look  at  O(n^) 
assignments  to  get  similar  results. 

In  [28]  partial  assignments  to  nodes  in  the  network  are  created  from  the 
root  nodes  down.  The  probability  of  each  such  assignment  is  easily  computable. 
Much  saving  in  computational  effort  is  achieved  by  not  bothering  about  irrele¬ 
vant  nodes,  i.e.  nodes  not  above  some  node  that  is  in  the  query  set,  or  nodes  that 
are  d-separated  from  the  evidence  nodes.  Later  in  that  paper,  an  assumption  of 
extreme  probabilities  is  made.  This  is  similar  to  the  skewness  assumption  above. 
In  fact,  in  the  circuit  fault  diagnosis  experiment  in  [28],  the  numbers  actually 
used  are  well  within  the  bounds  of  the  skewness  assumption.  The  algorithm 
makes  use  of  a  conflict  scheme  in  order  to  narrow  the  search. 

It  was  dready  suggested  [38,  17]  that  belief  networks  frequently  have  inde¬ 
pendence  structure  that  is  not  represented  by  the  topology.  Sometimes  inde¬ 
pendence  holds  given  a  particular  assignment  to  a  set  of  variables  V,  rather 
than  to  all  possible  assignments  to  V.  In  such  cases,  the  topology  is  no  help 
in  determining  independence  (e.g.  d-separation  might  not  hold),  the  actual  dis¬ 
tributions  might  have  to  be  examined.  In  [38]  the  idea  of  independence-based 
(IB)  assignments  was  introduced.  An  assignment  is  a  set  of  (node,  value)  pairs, 
which  can  also  be  written  as  a  set  of  node=value  instantiations.  An  assign¬ 
ment  is  consistent  if  each  node  is  assigned  at  most  one  value.  Two  assignments 
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are  compatible  if  their  union  is  consistent.  Each  assignment  denotes  a  (sam¬ 
ple  space)  event,  and  we  thus  use  the  assignment  and  the  event  it  denotes  as 
synonymous  terms  whenever  this  does  not  lead  to  ambiguity.  An  assignment  A 
subsumes  assignment  B  H  AC  B. 

The  IB  condition  holds  at  a  node  t>  w.r.t.  an  assignment  A  if  the  value 
assigned  to  v  by  >1  is  independent  of  all  possible  assignments  to  the  ancestors  of 
V  given  -^parents(t))i  assignment  made  by  A  to  the  immediate  predecessors 
(parents)  of  v.  An  assignment  is  IB  if  the  IB  condition  holds  at  every  v  £ 
span(,4),  where  span(^)  is  the  set  of  nodes  assigned  by  A.  A  hypercube  H  is  an 
assignment  to  a  node  v  and  some  of  its  parents.  In  this  case,  we  say  that  7i  is 
based  on  v.  W  is  an  IB  hypercube  if  the  IB  condition  holds  at  v  w.r.t.  71. 

In  [38],  IB  assignments  were  the  candidates  for  relevant  explanation.  Here, 
we  suggest  that  computing  marginal  probabilities  (whether  prior  or  posterior), 
can  be  done  by  enumerating  high-probability  IB  assignments,  rather  than  com¬ 
plete  assignments.  Since  IB  assignments  usually  have  fewer  vsiriables  assigned, 
each  IB  assignment  is  expected  to  hold  more  probability  mass  than  a  respective 
complete  (or  even  a  query  and  evidence  supported)  assignment.  The  probability 
of  an  IB  assignment  is  also  easy  to  compute  [38]: 

=  n  ■^(•^ti}l'^parents(v))  (1) 

v€span(^) 

where  ,45  is  the  assignment  A  restricted  to  the  set  of  nodes  S.  The  terms  in  i.he 
product  can  each  be  retrieved  in  0(1)  from  the  conditional  distribution  array 
(or  other  representation)  of  the  node  conditional  distribution. 

One  might  argue  that  searching  for  high-probability  assignments  for  ap¬ 
proximating  marginal  distributions  is  a  bad  idea,  since  coming  up  with  the 
highest-probability  assignment  is  NP-hard  [39].  Thus,  we  are  using  an  NP-hard 
algorithm  to  find  an  approximate  solution  to  an  NP-hard  problem,  where  we 
might  expect  that  a  polynomial  time  algorithm  can  be  sufficient  to  compute 
approximations.  However,  as  noted  above,  [11]  showed  that  this  problem  is  ^dso 
NP-hard.  Therefore,  using  this  kind  of  approximation  algorithm  is  a  reason¬ 
able  proposition,  provided  that  some  sub-classes  of  the  problem  that  are  bad 
for  existing  algorithms  can  be  shown  to  behave  well,  either  theoretically  or  by 
empirical  results  that  show  good  behaviour  on  the  average  for  some  classes  of 
problems.  Since  runtimes  of  our  algorithms  depend  in  a  complicated  manner  on 
the  conditional  probabilities,  it  is  very  bard  to  get  2my  theoretical  bounds  on 
the  runtime  for  interesting  classes  of  networks.  In  this  2uid  related  papers,  we 
thus  take  the  experimental  route  to  justify  our  performance  claums. 

The  rest  of  the  paper  is  orgamized  as  follows:  section  2  discusses  the  details 
of  how  to  approximate  posterior  probabilities  from  a  set  of  high-probability 
IB  assignments.  Section  3  reviews  the  IB  MAP  search  algorithm  of  [38],  and 
discusses  a  faster  heuristic  best-first  algorithm  for  finding  the  high-probability 
IB  assignments,  based  on  the  heuristic  presented  in  [7].  Section  4  reviews  the 
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reduction  of  IB  MAP  computation  to  linear  systems  of  equations  [38],  incorpo¬ 
rating  a  few  improvements  that  reduce  the  number  of  equations.  Searching  for 
next-best  assignments  using  linear  programming  is  discussed.  Section  5  presents 
experimental  timing  results  for  approximation  of  posterior  probabilities  on  ran¬ 
dom  networks.  We  conclude  with  other  related  work  and  em  evaluation  of  the 
IB  MAP  methods. 


2  Computing  Marginal  Probabilities  with  IB  As¬ 
signments 


The  probability  of  a  certain  node  instantiation,  u  =  ui ,  is  approximated  by  the 
probability  mass  in  the  IB  assignments  containing  v  =  ui  divided  by  the  total 
mass.  If  we  need  to  find  the  probability  of  v,  then  v  is  a  query  node.  Nodes 
where  evidence  is  introduced  are  called  evidence  nodes.  We  also  assume  that 
the  evidence  is  coqjunctive  in  nature,  i.e.  it  is  an  assignment  of  values  to  the 
evidence  nodes.  We  need  to  assume  that  each  enumerated  IB  assignment  A 
contains  some  assignment  to  query  node  v.  Otherwise,  it  might  be  impossible 
to  tell  which  p2u^t  of  the  mass  of  A  supports  v  =  vi.  Let  us  assume  for  now 
that  this  is  indeed  the  case,  i.e.  we  have  a  set  /  containing  IB  assignments,  and 
if  ,4  €  /  then  v  6  span(^).  Thus,  to  compute  the  (approximate)  probability  of 
V  =  vi,  we  compute: 


Pa(v  =  Vi) 


P{{A\A€l^{v  =  vi}£A}) 


Where  the  probability  of  a  set  of  assignments  is  the  probability  of  the  event 
that  is  the  imion  of  all  the  events  standing  for  all  the  assignments  (not  the 
probability  of  the  union  of  the  assignments).  If  we  are  computing  the  prior 
probability  of  v  =  vi ,  we  can  either  assume  that  the  denominator  is  1  (and  not 
bother  about  assignments  assigning  v  a  value  other  than  vi),  or  use  1-P{{A\A  € 
/})  as  an  error  bound.  If  all  IB  assignments  are  disjoint,  the  probability  of  the 
union  is  easily  computable,  and  is  simply  the  sum  of  probabilities  of  the  IB 
assignments. 

However,  since  IB  assignments  are  partial,  it  is  possible  for  the  events  de¬ 
noted  by  two  different  IB  assignments  to  overlap.  For  example,  let  {u,  v,  u;}  be 
nodes,  each  with  a  domain  {1,2,3}.  Then  ^  =  {u  =  l,r  =  2}  has  an  over¬ 
lap  with  B  =  (u  =  1, u;  =  1).  The  overlap  C  =  AvB  is  also  an  assignment: 
C  =  {«  =  1,  V  =  2,u;  =  3}*.  Thus,  to  compute  the  probability  of  a  set,  some 
other  method  must  be  used. 

Computing  the  union  of  the  IB  assignments  in  a  representation  that  makes 
computation  of  the  probabilities  easy  is  non-trivial.  It  turns  out  that  we  can 
use  the  inclusion-exclusion  principle,  due  to  the  following  property: 

’Note  that  for  two  aasignments  >,B,  the  anion  of  A  and  B  denote*  the  event  that  is  the 
intersection  of  the  events  denoted  by  A  and  B. 
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Theorem  1  Lei  A,  B  be  compatible  IB  assignments.  Then  AuB  is  also  an  IB 
assignment. 

Proof:  Let  C  =  AuB.  Clearly,  C  is  a  consistent  assignment.  It  suffices  to 
show  that  for  each  node  in  v  €  span{C),  the  IB  condition  holds  at  v  w.r.t.  C. 
Without  loss  of  generality,  let  v  6  span{A).  Then  v  is  independent  of  2dl  its 
of  ancestors  given  Aparentt(v)-  Now,  Aparentttv)  ^  Cparenti(v)i  ^od  thus,  by  a 
V2irieint  of  weak  union  ([26]  page  84),  v  is  independent  of  all  of  its  ancestors  not 
in  span{C)  n  parents(v)  given  Cpar«n«»(»)- 

Despite  this  theorem,  evaluating  the  probability  of  a  set  of  IB  assignments 
may  require  the  evaluation  of  an  exponential  number  of  terms.  That  is  due  to 
the  equation  for  implementing  the  inclusion-exclusion  principle; 

t=l  l<Oi<. ..<ot<m 

where  Ei  is  the  ith  event.  Several  ways  exist  to  overcome  this  problem,  which  we 
review  in  short  in  the  discussion  section.  In  the  description  of  the  algorithms, 
this  issue  is  temporarily  ignored  for  simplicity. 

How  many  of  the  highest  probability  IB  assignments  are  needed  in  order  to 
get  a  good  approximation?  Obviously,  in  the  worst  case  the  number  is  expo¬ 
nential  in  n.  However,  under  the  skewness  assumption  [12]  (reviewed  in  section 
1)  the  number  is  small.  In  fact,  it  follows  directly  from  the  skewness  theorem 
[12]  that  if  the  highest  (or  second  highest)  probability  complete  assignment  is 
compatible  with  Aopt,  the  highest  probability  IB  assignment,  and  Aopt  has  at 
least  log2  n  unassigned  nodes,  then  the  2  highest  IB  assignments  contain  niost 
(^  I )  of  probability  mass.  It  is  possible  to  extend  the  skewness  theorem 
to  include  0(n*)  terms,  in  which  case  the  mass  will  be  at  least  where 

Tie{x)  is  the  polynomial  consisting  of  the  first,  k  terms  of  Taylor  expansion  of 
e‘.  Thus,  under  the  above  conditions,  if  A^^t  has  (Jb  -I-  l)log2n  unassigned 
nodes,  the  highest  probability  IB  assignment  will  contain  at  least  of  the 
probability  mass. 

Given  a  set  of  query  and  evidence  nodes,  all  non-supported  (redundant) 
nodes  can  be  dropped  from  the  diagram.  A  node  v  is  supported  by  a  set  of 
nodes  V  if  it  is  in  V  or  if  v  is  an  ancestor  of  some  node  in  V.  A  node  supported 
by  the  evidence  nodes  is  called  evidentially  supported,  and  a  node  supported 
by  a  query  node  is  called  query  supported.  We  are  usually  only  interested  in  IB 
assignments  properly  evidentially  supported  by  some  set  of  evidence  nodes.  An 
assignment  is  properly  evidentially  supported  if  all  the  nodes  in  the  assignment 
have  a  directed  path  of  assigned  nodes  to  an  evidence  node.  Likewise,  an  IB 
assignment  is  properly  query  supported  if  every  node  in  the  assignment  obeys 
the  above  condition  w.r.t.  query  nodes. 

Before  we  start  searching  for  IB  assignments,  we  drop  sdl  evidence  nodes 
that  are  d-separated  from  the  query  nodes,  as  well  as  all  the  nodes  that  are  not 
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either  query  supported  or  supported  by  one  of  the  remaining  evidence  nodes. 

We  are  now  ready  do  describe  the  basic  anytime  best-first  search  algorithm. 
The  existence  of  a  generator  is  assumed.  Each  time  the  generator  is  callea,  it 
returns  the  next-best  (next  highest  probability)  IB  assignment  consistent  with 
a  set  of  initial  assignments.  Some  variants  of  the  algorithm  use  more  than  one 
generator  instance. 

•  Input:  a  Bayesian  belief  network  B,  evidence  £  (a  consistent  assignment), 
a  query  node  g. 

•  Output:  successively  improved  approximations  for  P(g  =  9,),  for  each 
value  qi  in  the  domain  of  node  q. 

1.  Preprocessing 

•  Sort  the  nodes  of  B  such  that  no  node  appears  after  any  of  its  an¬ 
cestors. 

•  Initialize  the  IB  hypercubes  for  each  node  v  S  B. 

2.  Initializing:  remove  redundant  nodes,  and  for  each  g,  in  the  domain  of  q 
do: 

(a)  Set  up  a  result  set  for  g,-. 

(b)  Add  the  assignment  f  U  {g  =  g,}  to  the  initial  assignment  set  for  the 
generator. 

3.  Repeat  until  time  limit  or  generator  returns  null: 

(a)  Get  next-best  IB  assignment  A  from  the  generator. 

(b)  Add  A  to  the  result  set  of  g,-,  where  {g  =  g,}  6  A. 

(c)  Update  the  posterior  probability  approximation. 

The  simplest  generator  is  a  best  first  search  with  the  current  probability 
heuristic,  which  is  exactly  the  inner  loop  of  the  algorithm  in  [38],  (also  described 
in  the  following  section).  In  this  paper,  we  also  look  at  two  other  generators: 
a  best-first  search  algorithm  based  on  the  cost-sharing  heuristic,  and  an  integer 
linear  program  scheme,  modified  from  [38]. 

The  posterior  probability  approximation  for  g  =  g,-  given  the  evidence  is: 

p  _  P(re8ult  set  for  gj) 

“  ^’(resuit  set  for  gj) 

As  before,  for  null  evidence,  I  —  l°(result  set  for  g^)  is  the  unassigned  prob¬ 
ability  mass,  and  can  be  used  to  bound  the  error,  as  in  [28].  In  order  to  bound 
the  error  for  non-null  evidence,  we  evaluate  the  probability  that  the  evidence 


8 


is  false  by  using  the  same  scheme.  That  is,  add  to  the  initialization  set  the 
assignment  FALSE  to  the  evidence  node^,  and  create  a  result  set  for  it. 

Note  that  the  preprocessing  need  only  be  done  once  for  each  network,  and 
the  results  can  be  used  for  different  query  and  evidence  sets.  Alternately,  it 
is  possible  to  do  most  of  the  preprocessing  incrementally  by  moving  it  into 
the  loop,  and  initialize  the  hypercubes  for  a  node  only  when  we  first  try  to 
expand  it  (i.e.  inside  the  generator).  This  way,  the  algorithm  can  sometimes 
start  providing  answers  before  initializing  all  the  hypercubes.  In  fact,  it  is 
not  even  necessary  that  the  belief  network  be  explicitly  represented  in  entirety. 
Applications  which  construct  belief  networks  incrementally  (such  as  WIMP  [5]) 
might  benefit  from  not  having  to  generate  parts  of  the  network  unless  needed 
for  abductive  conclusions. 

A  variant  of  the  algorithm  uses  several  generators,  one  for  each  assignment 
in  the  above  described  initialization  sets  (thus  each  generator  now  gets  an  ini¬ 
tialization  set  of  size  1).  Thus,  there  is  one  generator  for  the  negation  of  the 
evidence,  and  one  generator  for  each  state  of  the  query  node.  In  each  approxi¬ 
mation  step,  get  one  next-best  IB  assignment  from  each  generator,  and  proceed 
to  update  the  marginal  probability  estimate.  We  believe  that  this  version  of  the 
algorithm  achieves  a  better  balance  for  the  case  where  the  posterior  probability 
of  some  state  of  the  query  node  is  low,  and  thus  a  better  relative  approximation. 
This  issue  is  orthogonal  to  the  actual  algorithm  used  to  find  the  IB  assignments, 
and  is  ignored  henceforth. 

To  generalize  this  algorithm  to  m  query  nodes,  it  is  sufficient  to  initialize 
a  result  set  for  every  state  in  the  domain  of  each  node  {not  the  cross  product, 
which  is  what  we  would  so  if  we  wanted  to  find  the  posterior  joint  probability 
of  the  query  nodes),  and  to  suld  to  the  initialization  set  assignment  for  each 
such  state.  When  an  IB  assignment  is  found,  it  is  added  into  m  result  sets,  one 
for  each  query  node.  To  evaluate  the  probability  approximation,  divide  by  the 
probability  of  the  set  of  assignments  collected  for  only  one  of  the  nodes  (any 
one  will  do). 

Experimental  results  from  [38]  suggest  that  at  least  the  highest  probability 
IB  assignment  (the  IB-MAP)  can  be  found  in  reasonable  time  for  medium- 
size  networks  (up  to  100  nodes),  but  that  problems  start  occurring  for  many 
instances  of  larger  networks.  Experiments  for  finding  severed  of  the  next-best  IB 
assignments  are  reviewed  in  section  5,  and  timing  results  tend  to  indicate  that 
next-best  assignments  are  found  rather  quickly  after  the  first  one.  However,  we 
would  still  like  to  see  a  faster  algorithm.  The  method  of  using  IB  assignments 
to  approximate  posterior  probabilities  can  be  divorced  from  the  search  method 
(the  generator).  Any  generator  providing  the  IB  assignments  in  the  correct 

^We  can  assume  without  loss  of  generality  that  the  evidence  consists  of  a  single  binary 
vedued  node.  Otherwise,  create  a  new,  deterministic,  binary  valued  node  with  all  the  evidence 
nodes  as  parents,  with  the  requisite  function  (usually  the  new  node  is  TRUE  if  all  parents 
are  instantiated  as  in  the  evidence,  FALSE  otherwise,  i.e.  the  new  node  is  a  generalized  AND 
node). 
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order  will  do.  In  the  next  section,  we  discuss  how  the  linear  program  techniques 
used  in  [35,  38]  can  be  used  to  deliver  IB  assignments  in  decreasing  order  oi 
probability,  for  posterior  probability  approximation. 


3  Heuristic  Search  IB- Assignment  Generators 

In  this  section ,  we  present  best-first  heuristic  search  generators  for  the  marginal 
probability  approximation  algorithms.  We  begin  with  the  simple,  current  prob¬ 
ability  heuristic  algorithm  [38],  and  then  discuss  the  better  cost-sh2uing  heuris¬ 
tic.  The  best-first  algorithm  keeps  a  sorted  agenda  of  states,  where  a  state  is 
an  assignment,  a  node  last  exp2Uided,  and  a  probability  estimate: 

•  Input:  a  Bayesian  belief  network  B,  a  set  of  consistent  assignments  E 

•  Output:  The  next  best  IB  assignment,  consistent  with  some  f  €  E 

1.  Initializing:  for  each  £  in  E,  push  into  the  agenda  the  assignment  £  with 
a  probability  estimate  of  1. 

2.  Repeat  until  empty  agenda: 

(a)  Pop  assignment  with  highest  estimate  A  from  the  agenda,  and  remove 
duplicate  assignments  (they  will  all  be  at  the  top  of  the  agenda). 

(b)  If  is  IB,  return  it. 

(c)  Otherwise,  expand  A  at  v,  the  next  node,  into  a  set  of  assignments 
5,  and  for  each  assignment  A^  ES  do: 

i.  Estimate  the  probability  of  . 

ii.  Push  A?  with  its  probability  estimate  and  last-expanded  node  v 
into  the  agenda. 

When  the  generator  is  resumed  (i.e.  called  after  it  returns  the  first  time),  the 
resumption  point  is  at  step  2.  Expanding  a  state  and  the  probability  estimate 
is  exactly  as  in  [38]:  A^  =  AuH^ ,  where  W  is  the  jth  IB  hypercube  based  on 
V  that  is  maximal  w.r.t.  subsumption  and  consistent  with  A.  The  probability 
estimate  is  the  product  of  hypercube  probabilities  for  all  nodes  were  the  IB 
condition  holds. 

We  now  consider  improving  the  performance  of  our  search  algorithm  by  using 
a  different  heuristic  than  current  probability  (which  is  essentially  the  seune  as 
“cost-so-far”  [40]).  The  idea  is  that  cost-so-far  gives  little  information  early  on 
in  the  search,  while  including  costs  that  will  be  incurred  later  on  (higher  up  in 
the  DAG)  give  us  a  better  estimate.  However,  one  c2uinot  just  add  the  costs 
to  be  incurred  in  the  future,  because  in  multiply  connected  networks  one  node 
cost  (negative  logarithm  of  probability)  would  be  counted  multiple  times,  and 
we  would  no  longer  have  an  admissible  heuristic.  The  idea  of  dividing  the  cost  to 


10 


be  incurred  by  the  number  of  children,  the  “cost-sharing”  heuristic,  v/as  pursued 
in  [7]  for  proof  graphs  (AND/OR  DAGs).  The  cost  sharing  heuristic  showed  a 
mMked  improvement  in  performance  over  the  cost-so-far  heuristic  when  applied 
to  graphs  generated  by  WIMP  [6].  Since  the  above  generator  is  a  best-first 
search  algorithm  that  uses  the  cost-so-far  heuristic,  plugging  in  the  cost-sharing 
heuristic  ought  to  give  us  a  great  improvement  in  performance,  assuming  that 
it  can  be  applied  to  IB  assignment  search. 

The  cost-sharing  heuristic  is  admissible  only  if  the  expansion  operator  is  over 
edges  (rather  than  nodes),  and  obeys  the  minimal  cut  property.  A  cut  of  an 
AND  DAG  (a  directed  acyclic  graph  containing  only  AND  nodes)  is  a  set  of 
edges  E  such  that  every  path  from  any  root  node  to  a  leaf  node  contains  an 
edge  from  E.  A  cut  is  minima]  if  it  is  setwise  minimal,  i.e.  if  no  edge  can  be 
removed  from  E  such  that  there  is  still  a  cut.  (what  we  call  a  “minimal  cut” 
here  is  the  same  as  a  “cut”  in  [7].)  For  an  AND/OR  DAG  D,  £*  is  a  cut  if 
D  contains  a  complete  AND  DAG  (intuitively:  completely  specified  proof)  for 
which  E  is  a,  cut.  Likewise  for  a  minimal  cut  of  an  AND/OR  DAG. 

In  order  to  use  the  heuristic,  we  must  first  convert  our  problem  into  a 
weighted  AND/OR  DAG  (WAODAG),  and  then  provide  an  edge-b?  ;^'d  operator 
that  preserves  the  minimal  cut  property.  We  must  also  show  that  probabilities 
Me  equivalent  to  costs  in  the  WAODAG.  If  we  do  all  of  the  above,  we  Me  assured 
by  the  results  of  [7],  that  the  heuristic  is  admissible  for  our  scMch,  and  that  this 
algorithm  vMiant  indeed  comes  up  with  the  highest-probability  IB  assignment. 

To  convert  our  problem  into  the  WAODAG  formulation,  we  perform  a  con¬ 
struction  similar  to  [8]  or  [40].  The  algorithm  is  given  a  belief  network  B  = 
{V,A,P),  and  evidence  We  construct  a  WAODAG  W{B,€)  =  (G,c, /,  S) 
that  is  equivalent  to  the  original  belief  network  and  evidence.  In  a  WAODAG, 
G  is  the  underlying  DAG,  /  is  the  label  of  a  node,  which  is  either  AND  or  OR, 
c  is  the  cost  fimction,  and  5  is  the  sink  node.  The  construction  is  as  follows, 
except  for  the  costs,  which  Me  discussed  later  on: 

•  For  each  possible  node-state  {v,d)  £€u  {(r,  d)  |  v  G  V  -  span(£’)d  G 
construct  a  OR  node  W”  .  Note  that  for  each  evidence  node,  only  one 
state  is  possible. 

•  Construct  an  AND  node  S,  with  pMents  N”*  for  all  (v,  d)  G  £,  i.e.  all  the 
evidence  node-states. 

•  For  each  maximal  IB  hypercube  based  on  any  node  v  that  assigns  value  d 
to  V,  construct  an  AND  node  ,  where  »  is  the  index  of  the  hypercube 
among  ail  the  hypercubes  that  are  based  on  v  and  assign  a  value  d  to 
V  (the  actual  order  is  immaterial).  We  call  the  nude  image  of  the 

^Note  that  query  nodes  essentially  become  evidence  nodes,  in  the  context  of  searching  for 
the  best  IB  assignment.  Additionally,  we  can  assume  without  loss  of  generality  that  ail  nodes 
are  either  evidence  or  query  nodes,  or  ancestors  of  some  such  node  (otherwise  they  c^ul  just 
be  dropped  from  the  diagram). 
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hypercube,  and  use  it  as  a  synonym  for  the  hypercube  itself  whenever 
unambiguous. 

•  For  each  maximal  IB  hypercube  as  above,  construct  a  node  SCI'* , 
hypercube’s  “self  cost”  node. 

•  Construct  a  directed  edge  from  each  hypercube  to  N''* . 

«  Each  hypercube  assigns  some  state  to  some  of  a  node’s  parents.  Let 
be  the  set  of  node-state  pairs  in  hypercube  .  Construct  an  edge  from 
each  nude  €  A^*  to  . 

•  From  each  self-cost  node  SC^*  construct  a  directed  edge  ESC}'*  (self-cost 
edge)  to  the  respective  .  Node  SC^  can  be  viewed  as  an  AND  node 
with  no  parents. 

Let  AD  be  a  complete  AND  DAG  in  W,  and  let  7i  be  the  set  of  hypercubes 
in  AD.  Define  a{AD)  to  be  the  assignment  consisting  of  the  union  of  all 
the  hypercubes  6  "H.  Let  C  be  a  (minimal)  cut  of  W  consisting  only  of 
the  self-cost  edges  of  a  set  of  hypercubes  H.  As  before,  define  a{C)  as  the 
assignment  consisting  of  the  union  of  all  the  hypercubes  in  H.  Likewise,  if  C  is 
a  set  of  edges  (not  necessarily  a  cut  or  a  minimal  cut),  define  a{C)  to  be  the 
union  of  all  the  hypercubes  for  which  their  self-cost  edge  is  in  C. 

We  now  complete  the  definition  of  the  WAODAG  by  defining  the  cost  func¬ 
tion.  The  cost  of  a  self-cost  node  c(SC®^)  =  —  log  P(J?“*).  The  rest  of  the  costs 
are  defined  as  in  [7],  as  follows; 

•  The  cost  of  a  self-cost  edge  is  the  cost  of  its  source  node,  i.e.  c{ESCl* )  = 
c{SCf)  =  -\ogP{Hf). 

•  The  cost  of  a  hypercube  c{Hl*)  is  the  sum  of  the  costs  of  the  incoming 
edges. 

•  The  cost  of  each  edge  with  a  hypercube  node  as  a  source  is  the  cost  of  its 
source  hypercube. 

•  The  cost  of  any  other  edge  is  the  cost  of  its  source  node  divided  by 
the  number  of  children  that  v  has  in  B. 

•  The  cost  of  a  node-state  node  N'’*  is  the  minimum  of  the  costs  of  all  of 
its  incoming  edges. 

Since  W  is  a  DAG,  and  the  above  defines  costs  only  in  terms  of  the  belief 
network  or  edges  and  nodes  above,  this  definition  is  well  grounded  (i.e.  is  not 
circular). 
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The  generator  keeps  an  agenda  of  states,  where  a  state  is  a  set  of  edges,  a 
minimal  cut  C.  For  convenience  and  efficiency,  we  also  keep  the  hypercubes, 
last  exp2mded  node,  current  heuristic  value,  etc.  but  these  can  all  be  computed 
from  the  cut  C.  The  heuristic  value  of  a  state  is  the  sum  of  its  edge  costs. 
Our  expansion  operator  a  (a  function  from  a  set  of  edges  to  a  set  of  sets  of 
edges,  i.e.  essentially  from  a  state  to  a  set  of  next  states)  is  the  same  as  in  [7], 
except  that  when  an  OR  node  is  expanded  to  include  an  edge  from  a  hypercube 
to  TV"  ,  we  expand  the  hypercube  node  (which  is  an  AND  node)  as  well 
in  the  same  expansion  step.  This  does  not  affect  either  the  heuristic  value  or 
the  reachability  of  final  states.  Hence,  to  prove  that  this  heuristic  is  admissible, 
the  results  of  [7]  can  be  directly  applied.  We  begin  by  presenting  the  algorithm 
for  searching  for  IB  MAP.  Its  extension  to  compute  next  best  IB  assignments 
and  posterior  probabilities  is  exactly  the  same  as  for  the  cost-so-far  heuristic 
search.  To  use  this  algorithm  as  a  generator,  we  assume  that  there  is  only  one 
assignment  in  the  initialization  set,  and  call  it  the  evidence. 

•  Input:  a  Bayesian  belief  network  B,  and  a  consistent  assignment  £,  the 
evidence. 

•  Output:  Next  best  IB  assignment. 

1.  Initializing 

(a)  Remove  redundant  nodes. 

(b)  Create  the  WAODAG  from  the  top  down,  while  computing  node  and 
edge  costs. 

(c)  Create  a  dummy  outgoing  edge  e  from  S,  with  cost  equal  to  c(5) 
(the  dummy  “edge”  does  not  even  need  to  have  a  sink). 

(d)  Push  the  singleton  set  {e}  onto  the  agenda,  with  heuristic  cost  equal 
to  c(c). 

2.  Repeat  until  empty  agenda: 

(a)  Pop  state  with  lowest  cost  estimate  C  from  the  agenda,  and  remove 
duplicate  states  (they  will  all  be  at  the  top  of  the  agenda). 

(b)  If  the  assignment  is  IB  (all  edges  are  self  cost  edges),  return  it. 

(c)  Otherwise,  find  the  earliest  node  N''*  which  is  a  source  of  an  edge  c 
in  C,  and  compute  o'(C).  That  is,  expand  C  at  e  (we  also  call  this 
“expanding  at  node  v”)  into  a  set  of  states  C.  For  each  state  C  £C 
do  the  following: 

i.  Find  if  C  corresponds  to  a  consistent  assignment,  ^tnd  if  not, 
discard  it. 

ii.  Compute  the  heuristic  value  for  C  as  the  sum  of  edge  costs. 
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iii.  Push  C  into  the  agenda. 

Expanding  a  state  C  at  e  with  source  N”*  (computation  of  cr)  works  as 
follows:  Let  E  be  the  set  of  edges  with  source  N^'  in  C.  The  parents  of 
N''*  are  H"*  with  1  <  t  <  indegree(A^/’^).  Then  the  new  states  Q,  1  <  i  < 
indegree(A^’'  )  are  C,  =  C  —  E  +  incoming(/f"  ).  Note  that  by  construction, 
each  application  of  the  expansion  adds  one  self-cost  edge  to  C,  thereby  adding  a 
hypercube  to  the  assignment  defined  by  the  cut.  Each  of  the  new  states  amounts 
to  a  different  choice  of  hypercube  at  v,  just  like  the  cost-so  far  algorithm.  In 
[33],  we  show  that  the  algorithm  is  correct. 

This  algorithm  diverges  from  that  of  [7]  in  that  when  an  edge  from  an  OR 
node  is  expauided,  the  AND  node  which  is  its  parent  is  also  expanded  immedi¬ 
ately,  in  the  same  expansion  step.  This  does  not  affect  the  correctness  of  the 
algorithm,  since  for  all  consistent  self-cost-only  cuts  C  either  it  is  reachable,  or 
a(C)  is  subsumed  by  some  IB  assignment  a{C'),  where  C  is  reachable: 

The  algorithm  also  diverges  from  [7]  in  that  the  cost  of  an  edge  is  the  cost 
of  its  source  node  N'^  divided  by  the  out  degree  of  v  in  B,  rather  than  the  out 
degree  of  N''  in  VP.  That  is  permissible  because  of  all  the  nodes  (where  u 
is  a  child  of  v  in  B,  and  for  any  value  d'  and  hypercube  index  t)  only  one  appears 
in  any  AND  DAG,  and  thus  the  cost  is  only  shared  among  at  most  outdegree(t)) 
edges.  This  argument  is  similar  to  the  one  presented  in  the  conclusion  of  [7] . 

Note  that  it  is  possible  for  an  edge  e  with  source  v  to  exist  in  C,  where 
actually  the  IB  condition  holds  at  v  w.r.t.  a(C).  In  this  case,  we  still  need  to 
expand  e,  but  it  does  not  matter  which  next  state  is  selected,  they  all  result 
in  the  seune  assignment  (counting  only  consistent  assignments).  In  the  actual 
implementation,  to  prevent  the  creation  of  too  many  duplicate  states,  the  first 
one  of  these  that  is  consistent  is  pushed  into  the  agenda,  and  all  the  others  are 
discarded.  While  the  selection  of  edges  may  affect  the  seeirch  in  that  the  cost 
estimation  may  be  different,  it  cannot  affect  the  final  result  in  this  case. 

Finally,  how  is  this  generator  to  be  used  in  the  margined  probability  approxi¬ 
mation  edgorithm?  In  the  variant  where  each  generator  gets  a  singleton  set  as  an 
initiedization  set,  the  generator  need  no  modification.  In  the  variant  where  one 
generator  is  used  for  edl  states,  only  minor  modifications  are  needed,  as  follows. 
First,  we  need  to  create  Sj,  a  copy  of  the  node  S,  for  each  assignment  £,•  in  the 
initialization  set,  and  create  the  WAODAG  accordingly.  One  dummy  edge  Cj 
is  created  for  each  node  S,-,  and  one  agenda  item  is  created  initially  for  each 
£i.  No  further  modifications  are  necessary.  Something  is  lost  by  the  fact  that 
to  find  assignments  consistent  with  just  the  negation  of  the  original  evidence 
node  (used  for  bounding  the  error  in  the  approximation  algorithm),  we  do  not 
need  any  predecessors  of  the  query  nodes.  The  cost-sharing  heuristic  suffers 
somewhat  as  a  result  (even  though  it  is  still  admissible),  as  it  is  less  optimistic. 
To  avoid  this  problem,  one  can  always  just  use  the  multi-generator  version  of 
the  algorithm. 
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4  REDUCTION  TO  ILP 


In  [32],  [35],  [34],  2md  [31],  a  method  of  converting  the  complete  MAP  problem 
to  an  integer  linear  progr2Lm  (ILP)  was  shown.  In  [38]  a  similar  method  that 
converts  the  problem  of  finding  the  IB  MAP  to  a  linear  inequality  system  was 
shown.  We  begin  by  reviewing  the  reduction,  which  is  modified  somewhat  from 
[38]  in  order  to  decrease  the  number  of  equations,  and  discuss  the  further  changes 
necessary  to  make  the  system  find  the  next-best  IB  assignments. 

The  linear  system  of  inequalities  has  a  variable  for  each  maximal  IB  hyper¬ 
cube.  The  inequality  generation  is  reviewed  below.  A  belief  network  is  denoted 
by  B  =  (G,2)),  where  G  is  the  underlying  graph  and  X>  the  distribution.  We 
usually  omit  reference  to  2>  and  assume  that  all  discussion  is  with  respect  to 
the  same  arbitrary  distribution.  For  each  node  v  and  vadue  in  (the  domain 
of  «),  there  is  a  set  of  k^d  maximal  IB  hypercubes  based  on  t;  (where  d  £  D^). 
We  denote  that  set  by  W"  ,  and  assume  some  indexing  on  the  set.  Member  j  of 
'H''*  is  denoted  ,  with  k^d  >  j  >1. 

A  system  of  inequalities  L  is  a  triple  {V,I,c),  where  U  is  a  set  of  variables, 
7  is  a  set  of  inequalities,  and  c  is  an  assignment  cost  function. 

Definition  1  From  the  belief  network  B  and  the  evidence  £,  we  construct  a 
system  of  inequalities  L  =  Lib{B,£)  as  follows: 

1.  V  is  a  set  of  variables  hf* ,  indexed  by  the  set  of  all  evidentially  supported 
maximal  hypercubes  He  (the  set  of  hypercubes  H  such  that  if  H  is  based 
on  w,  then  w  is  evidentially  supported).  Thus,  V  =  {h^  IB"  £  He} 

2.  c{hf,  1)  =  -log{P{Hf)),  and  c(hf  ,0)  =  0. 

S.  I  is  the  following  collection  of  inequalities: 

(a)  For  each  triple  of  nodes  {v,x,y)  s.t.  x  ^  y  and  v  £  parents{x)  n 
parents{y),  and  for  each  d  £  D^: 

E  *?■+  E  ‘J'si  P) 

(v  .<<)€»**  .een* 


(b)  For  each  evidentially  supported  node  v  that  is  not  a  query  node  and 
is  not  in  span(£): 

E  E'*''  s '  P) 

d€Dr  i=l 


*The  superscript  v**  states  that  node  v  is  assigned  value  d  by  the  hypercube  (which  is  based 
on  v),  and  the  subscript  i  states  that  this  is  the  ith  hypercube  among  the  hypercubes  based 
on  V  that  iissign  the  value  d  to  v. 
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(c)  For  each  pair  of  nodes  w,  v  such  that  v  €  parents{w),  and  for  each 
value  d  £  D^: 


d'€D. 


(d)  For  each  {v,  d)  G  F: 


«=i 


(e)  For  each  query  node  q: 


deo,  1=1 


1 


(4) 


(5) 


(6) 


The  intuition  behind  these  inequalities  is  as  follows:  inequalities  of  type  a 
enforce  consistency  of  the  solution.  Type  b  inequalities  enforce  selection  of  at 
most  a  single  hypercube  based  on  each  node.  Type  c  inequalities  enforce  the 
IB  constraint,  i.e.  at  least  one  hypercube  based  on  v  must  be  selected  if  v  is 
assigned.  Type  d  inequalities  introduce  the  evidence,  and  type  e  introduces  the 
query  nodes.  Modifications  from  [38]  include  imploding  several  type  a  equations 
into  one,  reducing  the  number  of  such  equations  by  roughly  a  factor  quadratic 
in  the  number  of  hypercubes  per  node.  Other  modifications  Eire  making  type  b 
and  d  into  equalities  to  make  a  simpler  system,  and  adding  the  equations  for 
the  previously  unsupported  query  nodes. 

Following  [32],  we  define  an  assignment  s  for  the  variables  of  L  as  a  function 
from  V  to  V..  Furthermore: 

1.  If  the  range  of  s  is  in  {0,  1}  then  s  is  a  0-1  assignment. 

2.  If  s  satisfies  all  the  inequalities  of  types  a-d  then  s  is  a  solution  for  L. 

3.  If  solution  s  for  L  is  a  0-1  assignment,  then  it  is  a  0-1  solution  for  L. 

The  objective  function  to  optimize  is: 

eL,B(«)  =  -Y.s{hf)log{P{Hf))  (7) 

In  [38]  it  was  shown  that  a  optimal  0-1  solution  to  the  system  of  inequalities 
induces  an  IB  MAP  on  the  original  belief  network.  The  minor  modifications 
introduced  here,  while  having  a  favorable  effect  on  the  complexity,  encode  the 
same  constraint  and  this  do  not  affect  the  problem  equivalence  results  of  [38]. 
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If  the  optimal  solution  of  the  system  happens  to  be  0-1,  we  have  found  the 
IB  MAP.  Otherwise,  we  need  to  branch;  select  a  variable  h  which  is  £issigned 
a  non  0,1  value,  and  create  two  sets  of  inequalities  (subproblems),  one  with 
h  —  1  and  the  other  with  h  =  0.  Each  of  these  now  needs  to  be  solved  for  an 
optimal  0-1  solution,  as  in  [35].  This  branch  and  bound  algorithm  may  have  to 
solve  an  exponential  number  of  systems,  but  in  practice  that  is  not  the  case. 
Additionally,  the  subproblems  are  always  smaller  in  number  of  equations  or 
number  of  variables. 

To  create  a  subproblem,  h  is  clamped  to  either  0  or  1.  The  equations  can  now 
be  further  simplified;  a  variable  clamped  to  0  can  be  removed  from  the  system. 
For  a  variable  clamped  to  1,  the  following  reductions  take  place;  Find  the  type 
b  inequality,  the  type  d  equation  (if  any),  auid  all  the  type  a  inequalities,  in 
which  h  appears.  In  each  such  inequality  clamp  all  the  other  variables  to  0 
(removing  them  from  the  system),  and  delete  the  inequality.  After  doing  the 
above,  check  to  see  if  any  inequality  contains  only  one  variable,  and  if  so  clamp 
it  accordingly.  For  example,  if  a  type  d  equation  has  only  one  variable,  clamp 
it  to  1.  Repeat  these  operations  until  no  more  reductions  can  be  made. 

Once  the  optimal  0-1  solution  is  found,  we  need  to  add  an  equation  prohibit¬ 
ing  that  solution,  and  then  to  find  an  optimal  solution  to  the  resulting  set  of 
equations. 

Let  S  be  the  set  of  nodes  in  the  IB  assignment  A  induced  by  the  optimal 
0-1  solution.  To  update  the  system,  add  the  following  equation; 

E  E  1^1 

This  equation  prevents  2uiy  solution  which  induces  an  assignment  B  s.t.  the 
variables  in  5  are  assigned  the  same  values  as  in  A.  Thus,  it  is  not  just  a 
recurrence  of  A  that  is  prohibited,  but  of  any  assignment  B  subsumed  by  A,  in 
which  case  we  would  also  like  to  ignore  B. 

5  Preliminary  Experimental  Results 

As  shown  in  experiments  in  [38],  finding  highest  probability  IB  assignments 
is  feasible  for  up  to  medium-size  diagrams,  even  with  the  current  probability 
heuristic.  However,  the  method  begins  to  deteriorate  rapidly  steirting  at  100 
nodes.  Hence,  we  turn  to  the  cost  sharing  and  line2ir  programming  approaches. 

Preliminary  results  show  that  our  constraints  approach  can  solve  for  the  IB 
MAP  in  networks  of  up  to  2000  node.  Figure  1  compares  the  timing  results  of 
the  linear  programming  approach  on  50  networks  each  consisting  of  2000  nodes, 
with  the  current  cost  and  shared  cost  methods.  For  these  problem  instances, 
cost  sharing  usually  did  much  better  than  ILP,  which  in  turn  did  much  better 
than  current  cost.  However,  we  expect  that  on  larger  diagram  sizes,  ILP  will 
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2000  Nodes  (Timing) 

CPUSeooodix  10^ 


Figure  1:  2000  node  networks. 


do  better  than  cost  sharing,  which  we  intend  to  confirm  in  the  near  future.® 
For  the  most  part,  we  found  our  ILP  solutions  relatively  quickly.  We  would 
like  to  note  though,  that  our  package  for  solving  integer  linear  progreims  was 
crudely  constructed  by  the  authors  without  the  additional  optimizations  such 
as  sparse  systems,  etc.  Furthermore,  much  of  our  computationed  process  is 
naturally  parallelizable  and  should  benefit  immensely  from  techniques  such  as 
parallel  simplex  [18]  and  parallel  ILP  [2,  3]. 

^There  was  one  problem  instance  (not  shown)  for  which  cost-sharing  took  so  long  (and 
so  much  memory)  that  it  crashed  the  lisp  interpreter  (which  also  happened  for  several  of  the 
“failed”  cases  for  current  cost). 
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6 


Discussion 


Having  presented  several  IB  assignment  generators,  we  would  now  like  to  tie  the 
loose  ends  together  by  referring  to  the  problem  of  overlapping  IB  assignments. 
This  section  also  addresses  the  applicability  of  the  method,  i.e.  in  what  type 
of  networks  does  the  use  of  instantiation-based  independence  buy  us  anything. 
We  conclude  this  section  by  a  comparison  to  related  work. 

6.1  Treatment  of  Overlapping  Assignments 

There  eire  several  ways  to  handle  overlapping  assignments,  ranging  from  avoid¬ 
ing  overlaps,  to  approximating  the  inclusion-exclusion  formula.  Poole  suggests 
avoiding  this  problem  altogether  by  not  generating  any  overlaps  in  the  expla¬ 
nation  extension  stage.  In  [27],  marginal  probabilities  are  computed  by  adding 
up  the  probability  mass  in  disjoint  explanations,  which  are  a  special  case  of 
IB  assignments.  Disjointness  is  achieved  by,  whenever  generating  several  ex¬ 
tended  explanations  from  a  as  single  partial  explanations,  making  sure  that  the 
extended  explanations  are  disjoint.  To  do  that,  whenever  a  set  of  candidate 
explanations  A  is  considered  for  a  proposition  6  (where  the  database  has  rules 
of  the  form  6  ♦-  a,-,  for  each  a,-  6  A),  the  extended  explanations  will  consist  of 
the  set: 

{{6,  ai},  {6, 02,  ....  {6,a|^|,  -<ai, .... 

in  which  all  the  explanations  are  clearly  pairwise  (as  well  as  globsdly)  disjoint. 
Unfortunately,  when  translated  into  Bayes  net  format,  such  rule  sets  are  OR 
nodes,  and  it  is  not  clear  how  to  handle  other  types  of  nodes.  Poole  does  explain 
how  a  Bayes  net  might  be  represented  in  this  scheme.  It  turns  out,  however, 
that  negating  a  variable  in  the  explanation  maps  into  a  non-trivial  constraint 
in  the  network.  In  addition,  it  may  not  be  desirable  to  do  this  anyway.  What 
if  the  explanation  {6,  a|^|}  turns  out  to  be  by  far  the  most  likely  overall?  In 
Poole’s  scheme,  we  have  eliminated  much  of  its  probability  mass  early  on,  and 
there  is  no  reasonable  way  to  reorder  the  explanations  to  get,  say; 

The  way  Poole’s  algorithm  works,  the  ordering  is  imposed  based  on  some  heuris¬ 
tic  (e.g.  the  causal  strength  of  the  rules),  but  there  is  no  guarantee  that  what 
appears  best  at  mid-run  will  indeed  be  a  global  optimum.  For  these  reasons, 
we  do  not  employ  the  above  method  of  avoiding  overlaps.  Our  solutions  are 
based  on  defering  the  decision  as  to  which  explwation  (IB  assignment)  is  best, 
and  then  (if  necessary)  preventing  successive  IB  assignments  from  overlapping 
previous  ones.  At  the  stage  of  the  algorithm  where  we  get  the  IB  assignments, 
we  know  their  probability  ordering  exactly.  It  is  possible  to  defer  computation 
of  high  order  higher-order  terms  in  the  inclusion-exclusion  formula.  The  proba¬ 
bility  of  these  terms  diminishes,  and  we  could  ignore  them  in  the  computation. 
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That  is  because  low-probability  assignments  are  going  to  be  ignored  in  the  ap¬ 
proximation  algorithm  anyway.  Theoretically,  this  is  a  bad  idea.  As  shown  in 
[25],  we  need  a  very  large  number  of  terms  (about  in  the  worst  case)  to  get 
a  good  approximation  of  the  inclusion-exclusion  formula,  in  the  general  case. 
Still,  this  might  be  feasible  in  a  practical  algorithm. 

Nevertheless,  we  use  inclusion-exclusion  only  for  a  small  set  of  overlapping 
assignments,  and  prevent  the  occurrence  of  sets  of  overlapping  assignments  with 
cardinality  strictly  greater  than  some  small  integer  constant  k.  The  exact  value 
of  of  ib  depends  on  which  algorithm  variant  we  use.  In  the  best-first  heuristic 
search  algorithms,  it  is  hard  to  prevent  an  IB  assignment  from  overlapping  all 
other  assignments,  and  we  thus  use  ib  =  3.  If  an  IB  assignment  A  comes  up 
that  is  not  subsumed  by  some  previously  generated  IB  assignment  (in  which 
case  it  is  thrown  out),  we  do  the  following  test.  If  A  overlaps  more  than  k  IB 
assignments,  we  split  it  into  several  assignments  (which  are  not  necessarily  IB 
anymore),  and  toss  the  new  assignments  back  into  the  agenda.  The  details  of 
this  method  are  examined  elsewhere  [33].  In  the  ILP  version  of  the  algorithm, 
it  is  much  easier  to  prevent  an  overlap,  and  thus  we  can  use  k  =  0,  which  means 
that  no  overlapping  IB  assignments  are  generated.  See  [33]  for  details. 

6.2  Compactness  of  Hypercube  Representation 

Central  to  all  of  the  algorithms  for  approximating  marginal  probabilities  by  enu¬ 
merating  IB  assignments  is  the  number  of  maximal  IB  hypercubes  representing 
a  conditional  distribution.  The  number  of  assignments  generated  in  each  step 
of  the  loop  for  the  search-based  algorithm  is  some  fraction  of  the  number  of  by- 
percubes  per  node.  In  the  ILP  scheme,  the  number  of  equations  as  well  as  the 
number  of  terms  per  equation  also  depends  on  the  number  of  hypercubes.  The 
issue  of  how  many  hypercubes  are  needed  to  represent  a  conditional  distribution 
in  a  network  is  therefore  paramount. 

In  our  experiments,  rrndom  network  distributions  were  generated  by  split¬ 
ting  a  hypercube  into  subcubes  with  some  probability  p.  One  may  ask  if  this 
represents  a  typical  case  of  Bayesian  networks  in  applications.  We  noted  in  [38] 
that  dirty  OR  nodes  (as  well  as  pure  OR  nodes,  AND  nodes,  etc.)  are  com¬ 
pactly  representable  as  maximal  IB  hypercubes  (2k  hypercubes  for  a  k  parent 
dirty  OR  node).  However,  a  much  more  commonly  encountered  type  of  node  is 
the  noisy  OR.  It  turns  out  that  if  a  noisy  OR  is  represented  as  a  single  node 
(with  a  single  distribution  array),  its  maximal  hypercube  representation  is  of 
size  2*,  which  is  certainly  not  compact. 

If  we  use  5-IB  hypercubes  [38],  and  the  noisy  OR  has  high  causal  strength 
links,  again  2k  maximal  £-IB  hypercubes  will  suffice.  However,  it  is  unclear 
how  6-IB  assignments  can  be  used  for  approximating  marginal  probabilities 
without  severely  impairing  the  precision  of  the  approximation  algorithm  (the 
probability  of  a  j-IB  assignment  can  only  be  approximated  in  linear  time,  but 
not  computed  exactly).  It  turns  out,  however,  that  representing  a  noisy  OR  in 
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its  causal  independence  form  [1,  26],  with  extra  nodes  added  between  the  causes 
and  the  noisy  OR  node  itself,  things  are  much  improved.  In  this  representation, 
the  noisy  OR  is  translated  into  a  pure  OR,  and  can  be  represented  with  2k 
hypercubes.  The  additional  nodes  require  another  4jt  hypercubes.  And,  in  fact, 
if  zero  probability  hypercubes  are  dropped  (they  will  never  participate  in  any 
useful  IB  assignment),  a  total  of  3k  hypercubes  suffice  to  describe  a  noisy  OR. 

Since  the  commonly  used  noisy  OR  is  compactly  representable,  it  is  interest¬ 
ing  whether  other  cases  of  causal  independence  can  be  represented  compactly, 
such  as  noisy  MAX  [29],  as  well  as  more  general  types  of  causal  independence. 
For  deterministic  binary  valued  nodes,  this  question  reduces  to  the  question;  is 
there  a  compact  DNF  representation  for  the  function?  Likewise,  in  the  causal 
independence  representation,  the  deterministic  part  is  separated  out  as  a  deter¬ 
ministic  function  /  and  k  noisy  “channels” .  If  /  is  compactly  representable  in 
DNF,  then  the  answer  is  affirmative.  In  the  more  general  case  of  multi-valued 
nodes,  the  issue  is  a  bit  more  complicated,  and  is  beyond  the  scope  of  this 
paper.  Suffice  it  to  say  that  the  Generalized  IB  hypercubes  presented  in  [37] 
can  be  used  to  better  advantage  with  multi-valued  nodes,  since  they  allow  the 
aggregation  of  values  in  a  node  into  a  single  “m£u:ro  value” . 

6.3  Related  and  Future  Work 

The  work  on  term  computation  [12]  and  related  papers  are  extremely  relevant 
to  this  paper.  The  skewness  assumption  made  there,  or  a  weaker  version  of  it, 
also  make  our  method  applicable.  In  a  sense,  these  methods  complement  each 
other,  and  it  should  be  interesting  to  see  whether  IB  assignments  (or  at  least 
maximal  IB  hypercubes)  can  be  incorporated  into  a  term  computation  scheme. 

This  paper  enumerates  high  probability  IB  assignments  using  a  backward 
search  from  the  evidence.  [28]  also  enumerates  high  probability  assignments, 
but  using  a  top  down  (forward)  search.  Backward  constraints  are  introduced 
through  conflicts.  It  is  clear  that  the  method  is  efficient  for  the  example  do¬ 
main  (circuit  fault  analysis),  but  it  is  less  than  certain  whether  other  domains 
would  obey  the  extreme  probability  assumption  that  makes  this  work.  If  that 
assumption  does  not  hold,  it  may  turn  out  that  bsckwud  search  is  still  better. 
On  the  other  hand,  it  should  be  possible  to  take  adv^mtage  of  IB  hypercubes 
even  in  the  forward  search  approach  [33].  Note  that  among  the  algorithms  pre¬ 
sented  here,  the  current  probability  heuristic  ignores  forward  constraints,  while 
the  shared-cost  heuristic  does  employ  some  form  of  forward  reasoning  by  in¬ 
corporating  the  costs  from  the  top-down  initialization.  The  ILP  method  uses 
global  constraints  that  also  include  the  top-down  constraints,  but  what  role  the 
top-down  constraints  play  in  the  search  is  unclear. 

Several  stochastic  approximation  algorithms  find  the  MAP.  For  example, 
in  [16]  simulated  annealing  is  used.  It  is  not  clear,  however,  how  one  might 
use  it  either  to  enumerate  a  number  of  high-probability  assignments  or  make  it 
search  for  the  IB  MAP.  A  genetic  algorithm  for  finding  the  MAP  [30]  makes  a 
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more  interesting  case.  The  authors  in  [30]  note  that  the  probability  mass  of  the 
population  rises  during  the  search  and  converges  on  some  value.  They  do  not  say 
whether  assignments  in  the  population  include  duplicates,  however,  and  make 
no  mention  of  the  possibility  of  approximating  marginal  probabilities  with  that 
population.  It  seems  likely  that  if  the  search  can  be  modified  to  search  among 
IB  assignments,  then  the  fact  that  a  whole  population  is  used,  rather  than  a 
single  csmdidate,  may  provide  a  ready  source  of  near-optimal  IB  assignments. 
Of  course,  we  are  not  guaranteed  to  get  IB  assignments  in  decreasing  order  of 
probability,  so  slightly  different  methods  would  have  to  be  used  to  approximate 
the  marginal  probabilities. 

Finally,  it  should  be  possible  to  modify  the  algorithms  presented  in  this 
paper  to  work  on  GIB  assignments  and  6-IB  assignments,  where  an  even  greater 
probability  mass  is  packed  into  an  assignment  [38,  37].  Some  theoretical  issues 
will  have  to  be  resolved  before  we  can  do  that,  however. 

7  SUMMARY 

Computing  marginal  (prior  or  posterior)  probabilities  in  belief  networks  is  hard. 
Approximation  schemes  are  thus  of  interest.  Several  deterministic  approxima¬ 
tion  schemes  enumerate  terms,  or  assignments  to  sets  of  vzu'iables,  of  high  prob¬ 
ability,  such  that  a  relatively  small  number  of  them  contmn  most  of  the  proba¬ 
bility  mass.  This  allows  for  an  anytime  approximation  algorithm,  whereby  the 
approximation  improves  as  a  larger  number  of  terms  is  collected.  IB  assign¬ 
ments  are  partial  assignments  that  take  advantage  of  local  independencies  not 
represented  by  the  topology  of  the  network,  to  reduce  the  number  of  assigned 
variables,  and  hence  the  probability  mass  in  each  assignment. 

What  remains  to  be  done  is  to  come  up  with  these  IB  assignments  in  a 
decreasing  order  of  probability.  This  is  also  a  hard  problem  in  general,  unfor¬ 
tunately.  The  factors  contributing  to  complexity,  however,  are  not  maximum 
clique  size  or  loop  cutset,  but  rather  the  number  of  hypercubes.  Under  prob¬ 
ability  skewness  assumptions,  the  search  for  high  probability  IB  assignments 
is  typically  more  efficient,  and  the  resulting  approximation  (collecting  a  small 
number  of  assignments)  is  better. 

Three  algorithms  for  approximating  marginad  algorithms  2U’e  presented:  a 
modification  of  a  node-based  best-first  search  algorithm  for  finding  the  IB  MAP, 
an  edge-based  best-first  search  algorithm  with  a  cost-sharing  heuristic,  and  an 
algorithm  based  on  linear  systems  of  inequalities.  We  have  also  experimented  on 
highly  connected  diagrams  were  the  conditional  probabilities  are  represented  as 
sets  of  hypercubes  (distribution  arrays  are  precluded,  since  they  are  exponential 
in  size),  and  got  favorable  results  in  cases  where  the  join- tree  algorithm  cannot 
handle  in  practice. 

Preliminary  results  show  that  using  the  cost-sharing  heuristic  improves  per¬ 
formance  of  the  best-first  search  algorithm  by  more  than  one  order  of  mag- 
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nitude,  and  that  the  algorithm  based  on  linear  systems  of  inequalities  is  still 
faster.  Naturally,  more  conclusive  experiments  will  have  to  be  performed. 
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